Stereoencephalography (SEEG) is a technique used in drug-resistant epilepsy patients that may be a candidate for surgical resection of the epileptogenic zone. Multiple electrodes are placed using a so-called "frame based" stereotactic approach, in our case using the Leksell frame. In our previous paper "Methodology, outcome, safety and in vivo accuracy in traditional frame-based stereoelectroencephalography" by Van der Loo et al (2017) we reported on SEEG electrode implantation accuracy in a cohort of 71 patients who were operated between September 2008 and April 2016, in whom a total of 902 electrodes were implanted. Data for in vivo application accuracy analysis were available for 866 electrodes.
The goal of the current project is to use a public version of this dataset (without any personal identifiers) to predict electrode implantation accuracy by using and comparing different machine learning approaches.
Pieter Kubben, MD, PhD
neurosurgeon @ Maastricht University Medical Center, The Netherlands
The public dataset contains these variables:
The electrodes are the Microdeep depth electrodes by DIXI Medical.
To the limited extent possible in this case I tried to make these FAIR data and adhere to FAIR guiding principles. In practice this meant I introduced the topic, described my data and created a DOI.
Now let's get started.
In [169]:
# import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
%matplotlib inline
import warnings; warnings.simplefilter('ignore')
%xmode plain; # shorter error messages
# global setting whether to save figures or not
# will save as 300 dpi PNG - all filenames start with "fig_"
save_figures = False
In [170]:
# load data
electrodes = pd.read_csv('electrodes_public.csv')
# find missing values
nan_rows = sum([True for idx,row in electrodes.iterrows() if any(row.isnull())])
print('Nr of rows with missing values:', nan_rows)
We will calculate target point localization error (TPLE) using the Euclidean distance and remove the entry data as we won't be using them (still wanted to share them in dataset though).
In [171]:
# calculate TPLE and remove entry data from dataframe
electrodes['TPLE'] = np.sqrt(np.square(electrodes['TipX'] - electrodes['PlanningX']) +
np.square(electrodes['TipY'] - electrodes['PlanningY']) +
np.square(electrodes['TipZ'] - electrodes['PlanningZ'])
).round(1)
electrodes.drop(['EntryX', 'EntryY', 'EntryZ'], axis = 1, inplace = True)
electrodes.head()
Out[171]:
In [172]:
#electrodes
Now we will remove large outliers (difference between planned and real coord) in the Z-axis as electrode insertion length (depth) is influenced also by other factors (calculations regarding depth which could lead to either too superficial or too deep, but also possible malfixation of the screw cap which may cause loosening of the electrode and hence a more superficial position.. it won't migrate into the depth spontaneously). These are very limited numbers and would too much influence further analysis.
In [173]:
# check for outliers in Z axis
large_depth_error = electrodes[np.abs(electrodes['PlanningZ'] - electrodes['TipZ']) > 10]
print('Outliers in Z axis (> 10mm):\n\n', large_depth_error['TPLE'])
# remove outliers
electrodes.drop(large_depth_error.index, inplace = True)
print('\nNew dataframe shape:', electrodes.shape) # removed 6 rows
We need to structure our data properly for further analysis and convert the categorical variables (nominal, ordinal) to the category
type.
In [174]:
# convert categorical columns to "category" dtype
catcols = ['PatientPosition', 'Contacts', 'ElectrodeType', 'ScrewLength']
for cat in catcols:
electrodes[cat] = electrodes[cat].astype('category')
# confirm correct types for all columns now
electrodes.dtypes
Out[174]:
Let's get a short description of our TPLE data.
In [175]:
# get summary data on TPLE
tple = electrodes['TPLE']
tple.describe().round(1)
Out[175]:
We now have data in the right format, but for classification we need to bin the continuous outcome variables EPLE and TPLE into categories. Alternatively we could approach this as a regression problem, but given the relative limited amount of data classification should lead to a better prediction model that is still relevant for potential clinical use.
Let's create a new variable TPLE category
for this purpose.
I added some code here to do regression by using basic neural network from Scikit Learing package. > I am not sure if I get your point about the purpose of this prediction. Pedro and I discussed in this week, we listed 3 possible purpoes of this prediction:
1. Predict real Cartesian X, Y, Z coord of target point: for optimizing planning coord (Doctor can change
their plan before the surgery)
2. Predict TPLE (the error between planning and real target point): for predicting human error (Doctor will
not change their plan, but they can decide if they operate the surgery according to the predicted error)
3. Find the charactistics of the surgery: for figuring out which feature causes errors.
However, TipX, TipY, TipX cannot be regarded as training features in any case. They should be target (predictable) features. So, no matter which classification or regression we are going to do, TipX, TipY, and TipZ need to be removed from training matrix.
I tried one basic neural network from Scikit learn to predict TipX, TipY, TipZ. This model is assumed to be used before the surgery. Doctors input their planning coord, patient postion, electron type and other features, then they get the predicted coord of target point from model.
In [176]:
from collections import Counter
from sklearn import neighbors
from sklearn.neural_network import MLPRegressor
from sklearn.model_selection import cross_val_score
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
In [177]:
print("Number of missing value in PlanningRing:", Counter(electrodes[ 'PlanningRing'].isnull())[True])
print("Number of missing value in PlanningArc:", Counter(electrodes[ 'PlanningArc'].isnull())[True])
print("Number of missing value in DuraTipDistancePlanned:", Counter(electrodes[ 'DuraTipDistancePlanned'].isnull())[True])
print("Number of missing value in ScrewLength:", Counter(electrodes[ 'ScrewLength'].isnull())[True])
print("Number of missing value in Contacts:", Counter(electrodes['Contacts'].isnull())[True])
For making it simple, I am going to use complete real data sets instead of filling the missing values. For features, "ScrewLength" have a lot of missing values, so I am not going to use it in my model. "ScrewLength" will influence the quality of regression model. I will remove the users with missing values in "DuraTipDistancePlanned" and "Contacts". Then I will show both results of with and without "PlanningRing" and "PlanningArc".
In [178]:
PlanningRing = electrodes['PlanningRing']
PlanningArc = electrodes['PlanningArc']
electrodes=electrodes.drop(['PlanningRing', 'PlanningArc', 'ScrewLength'], axis=1)
new_electrodes=electrodes[np.invert(pd.isnull(electrodes).any(axis=1))]
PlanningRing=PlanningRing[np.invert(pd.isnull(electrodes).any(axis=1))]
PlanningArc=PlanningArc[np.invert(pd.isnull(electrodes).any(axis=1))]
new_electrodes = new_electrodes.assign(PlanningRing=pd.Series(PlanningRing).values)
new_electrodes = new_electrodes.assign(PlanningArc=pd.Series(PlanningArc).values)
print('Number of instances:', len(new_electrodes))
In [179]:
# I transformed the string in PatientPosition and ElectrodeType into numbers
# to make it easy to predict.
rep_pat=[]
rep_ele=[]
for i in range(0, len(new_electrodes['PatientPosition'])):
try :
if new_electrodes['PatientPosition'][i] == 'Prone':
rep_pat.append(1)
elif new_electrodes['PatientPosition'][i] == 'Supine':
rep_pat.append(2)
except:
rep_pat.append(0)
for i in range(0, len(new_electrodes['ElectrodeType'])):
try:
if new_electrodes['ElectrodeType'][i] == 'Oblique':
rep_ele.append(1)
elif new_electrodes['ElectrodeType'][i] == 'Orthogonal':
rep_ele.append(2)
except:
rep_ele.append(0)
new_electrodes = new_electrodes.drop(['PatientPosition', 'ElectrodeType'], axis=1)
new_electrodes = new_electrodes.assign(PatientPosition=pd.Series(rep_pat).values)
new_electrodes = new_electrodes.assign(ElectrodeType=pd.Series(rep_ele).values)
You can see from the following figure, "4. Subine postion + Orthogonal electron type" has the best performance in these surgeries because of its lowest TPLE. "3. Subine postion + Oblique electron type" has very focus TPLE value which means in this case doctors always get this difference between their planning and real points. Additionally, "2. Prone postion + Orthogonal electron type" is not stable. Dockers are easy to get big TPLE (>5).
In [180]:
ObliqueType = new_electrodes[new_electrodes.ElectrodeType == 1]
OrthogonalType = new_electrodes[new_electrodes.ElectrodeType == 2]
PronePosition = new_electrodes[new_electrodes.PatientPosition == 1]
SupinePosition = new_electrodes[new_electrodes.PatientPosition == 2]
plt.figure(figsize=[16, 8])
# sns.kdeplot(drop_target['TPLE'], label='all')
sns.kdeplot(PronePosition[PronePosition['ElectrodeType']==1]['TPLE'], label='Prone--Oblique')
sns.kdeplot(PronePosition[PronePosition['ElectrodeType']==2]['TPLE'], label='Prone--Orthogonal')
sns.kdeplot(SupinePosition[SupinePosition['ElectrodeType']==1]['TPLE'], label='Subine--Oblique')
sns.kdeplot(SupinePosition[SupinePosition['ElectrodeType']==2]['TPLE'], label='Subine--Orthogonal')
Out[180]:
Here, I did two experiments based on two conditions -- one: I removed PlanningRing and PlanningArc features because they include too many missing value, but keep all instances (patients)
In [181]:
mis_target = new_electrodes['TPLE']
mis_target_X=new_electrodes['TipX']
mis_target_Y=new_electrodes['TipY']
mis_target_Z=new_electrodes['TipZ']
mis_electrodes = new_electrodes.drop(['PlanningRing', 'PlanningArc', 'TipX', 'TipY', 'TipZ', 'TPLE'], axis=1)
Two: I removed all instances (patients) which has missing values but keep PlanningRing and PlanningArc features.
In [182]:
cpt_electrodes=new_electrodes[np.invert(pd.isnull(new_electrodes).any(axis=1))]
cpt_target = cpt_electrodes['TPLE']
cpt_target_X=cpt_electrodes['TipX']
cpt_target_Y=cpt_electrodes['TipY']
cpt_target_Z=cpt_electrodes['TipZ']
cpt_electrodes = cpt_electrodes.drop(['TipX', 'TipY', 'TipZ', 'TPLE'], axis=1)
I predicted TipX TipY TipZ and used r2 and explained variance as evaluation scores. Best possible is 1.0 Here is the link you can find what is r2 and explained variance
R2: https://en.wikipedia.org/wiki/Coefficient_of_determination Explained Variance: https://en.wikipedia.org/wiki/Explained_variation
In [183]:
train = cpt_electrodes # mis_electrodes
target = np.transpose([cpt_target_X,cpt_target_Y,cpt_target_Z])
nn = MLPRegressor(activation='identity', solver='lbfgs', max_iter=1000, \
random_state=6, hidden_layer_sizes=20)
cvscore = cross_val_score(nn, train, target, scoring = 'r2', cv = 10) #explained_variance #neg_mean_squared_error
# print('10-fold cross-validation score:', cvscore)
print('Ave: ', np.mean(cvscore))
In [ ]:
# # Fit regression model
# n_neighbors = 12
# err = []
# for i, weights in enumerate(['distance']): #'uniform',
# knn = neighbors.KNeighborsRegressor(n_neighbors, weights=weights)
# y_pred = knn.fit(X_train, y_train).predict(X_test)
# for i in range(0, len(y_test)):
# print(list(y_test)[i], list(y_pred)[i], list(y_test)[i]-list(y_pred)[i])
# err.append(list(y_test)[i]-list(y_pred)[i])
# #print('MAE', r2_score(y_test, y_pred)) #mean_absolute_error
# print('RSS', mean_squared_error(y_test, y_pred)) #mean_absolute_error mean_squared_error
# np.sqrt(mean_squared_error(y_test, y_pred))
In [133]:
# temp=np.transpose([list(X_test['PlanningX']),list(X_test['PlanningY']),list(X_test['PlanningZ'])])
# com = pd.DataFrame.from_records(temp)
# com.columns = ['planX', 'planY', 'planZ']
# com = com.assign(TipX=pd.Series(y_test[:, 0]).values)
# com = com.assign(TipY=pd.Series(y_test[:, 1]).values)
# com = com.assign(TipZ=pd.Series(y_test[:, 2]).values)
# com = com.assign(predX=pd.Series(y_pred[:, 0]).values)
# com = com.assign(predY=pd.Series(y_pred[:, 1]).values)
# com = com.assign(predZ=pd.Series(y_pred[:, 2]).values)
# pred_point = np.sqrt(np.square(com['predX'] - com['planX']) +
# np.square(com['predY'] - com['planY']) +
# np.square(com['predZ'] - com['planZ'])
# ).round(1)
# tip_point = np.sqrt(np.square(com['TipX'] - com['planX']) +
# np.square(com['TipY'] - com['planY']) +
# np.square(com['TipZ'] - com['planZ'])
# ).round(1)
# com = com.assign(pred_point=pd.Series(pred_point).values)
# com = com.assign(tip_point=pd.Series(tip_point).values)
# print('RSS', mean_squared_error(tip_point, pred_point))
# com
In [134]:
# from sklearn.tree import ExtraTreeRegressor
# from collections import Counter
# # Build a forest and compute the feature importances
# forest = ExtraTreeRegressor()
# forest.fit(train, target)
# importances = forest.feature_importances_
# # std = np.std([tree.feature_importances_ for tree in forest.estimators_], axis=0)
# indices = np.argsort(importances)[::-1]
# # Print the feature ranking
# print("Feature ranking:")
# for f in range(train.shape[1]):
# print("%d. feature %d %s (%f)" % (f + 1, indices[f], train.columns[indices[f]], importances[indices[f]]))
# # Plot the feature importances of the forest
# plt.figure()
# plt.title("Feature importances")
# plt.bar(range(train.shape[1]), importances[indices],
# color="r", align="center")
# plt.xticks(range(train.shape[1]), indices)
# plt.xlim([-1, train.shape[1]])
# plt.show()
In [ ]: